Exploring Red Wine Data by Abdulmajeed Aljaloud

Introduction

In this project, we’ll be exploring the red wine dataset for some interesting trends. The dataset conatins around 1600 instances, and each instance has 11 features and a label that corresponds to the quality of the wine, according to wine experts.

Loading The Data

First we load the data and take a look at the summaries.

## [1] "Variables:"
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] "Date frame:"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] "Summaries:"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

First thing we are going to do is plot each variable distribution by creating univariate plots to understand the structure of the individual variables in our dataset.

The distribution seems like a normal distribution with a slight positive skew.

volatile.acidity also is slighty positivly skewed. So There migh be realtionship with fixed.acidity

citric.acid seems quite noisy, with very high counts at the 0 and 0.5 values.

residual.sugar is heavily positvly skewed. This shows that most sugar values are very low.

We should transform this distribution to log scale to better look at it

This looks more like a skewed normal distribution.

Chlorides similar with residual.sugar, there migh be a relationship here. Let’s see how it looks in log scale.

Doesn’t look that much like residual.sugar anymore. so probably not related.

free.sulfur.dioxide Positvly skewed. But doesn’t closly resemble another distribution.

total.sulfur.dioxide has a strong positive skew.

This is interesting, Density almost has a perfect unskewed normal distribution.

Also the same for pH, but with a slight shift to the left. There could be a realtionship between the two.

Sulphates is slightly posivly skewed.

Alcohol is also positivly skewed, but with big noise. It looks slighly like free.sulfur.dioxide, so this should be invistigated.

Quality has a normal distibution with some noise, which makes sense due to the low number of possbile discrete values (1 through 10).

Univariate Analysis

What is the structure of your dataset?

This Red Wine dataset contains 1,599 incstances with 11 features of the chemical properties of the wine, and 1 output ‘quality’. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

There are no missing values. And it’s also worth noticing that the minimum quality wine has a score of 3, and the maximum has a score of 8. So there are no very bad or very good wines, and most wines lie in the middle.

What is/are the main feature(s) of interest in your dataset?

The main feature is obviously the output quality. It would be interesting if we can find a correlation between the quality and one of the other features.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Probably pH. From our Univariate plots, we can see a high resemblance between the two features’ distributions. So there is probably a relationship there.

Did you create any new variables from existing variables in the dataset?

No. It didn’t seem like there was a need for a new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most distributions were positivly skewed. Which makes since because most companies would probably try to minimize the chmicals values, except for some outliers.

The alcohol distribution was probably the most unusual distribution. Although it also is a bet similar to the free.sulfur.dioxide so there might be a relationship there.

Some tranformations to semi-log scale. That’s because some plots had very small values and were clumped up in a small area. Tranforming them to log-scale makes it easier to visualize the data.

Bivariate Plots Section

There are three things we should look into.

Plots Against The Output Feature

Here we are going to plot box plots for every feature v.s. the output feature, and look at statistic summaries. Best way to represent these plots is boxplots. Because it gives us so much information in a nice way for data grouped by catigory (in this case the quality rating). The red dots in the plots are the means.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

No clear relationship between fixed.acidity and the quality.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

This is interesting. It seems that (in general) the lower the volatile.acidity, the higher the quality. So they are reversly related.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Some trend here, but not very clear. In general, higher qualities tend to have higher citric.acid.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

No clear realationship appearnt.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

No clear realationship appearnt.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

This is a bit interesting, it kind of looks like a normal distribution. Wines with high free.sulfur.dioxide tend to be ‘average’ where the ones with low free.sulfur.dioxide are either ‘good’ or ‘bad’.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

Also similar to free.sulfur.dioxide. Which kind of makes sense because they are both sulfur.dioxide, so total.sulfur.dioxide probably accounts for free.sulfur.dioxide so it’s distribution is affected by it.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

Nothing obvious here. But using the summary, we can see that in general the lower the mean/median the higher the quality. It’s very subtle, though.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.163   3.230   3.267   3.350   3.720

Similar to density. Not vey obvious, but as median/mean decrease, the quality increases.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

It seems in general the higher the sulphates, the beeter the quality.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Similar to sulphates. Good quality wines seem to have higher values of alcohol.

Plots of Features With Similar Distributions.

In the Univariate Plots, we noticed some features having similar distributions. In this section we’ll be plotting these features agianst each other to see if ther is a patter.

First two distributions that had a similar shape are residual.sugar and chlorides:

There seem to be alot of outliers, we should plot this in log scale.

Much better. It seem to be consitrated around the bottom-left. There are some outliers of course, but overall there doesn’s seem to be a relationship.

Next up, alcohol and free.sulfur.dioxide:

The log scale doesn’t seem to make that much of a change. The plot is still scattered with values everywhere. Don’t seem related.

Next, fixed.acidity and volatile.acidity:

It seems to be consentrated around the middle, but not as much. This one is very scattered and spread out. There doesn’t seem to be a relationship.

Finally, density and pH:

Concentrated in the middle, but a bit scattered. No clear correlation.

This is interesting. Even though these distributions seemed similar in the univariate plots, that doesn’t mean there is a relationship!

Plots of Correlated Features

We should look at the correlation of features, and plot the ones that seem correlated to see how they look.

## [1] "Correlations:"
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

There isn’t anything that is very highly correlated, but we’ll look at the 4 highest pairs. These are:

  • fixed.acidity and citric.acid: 0.67
  • fixed.acidity and density: 0.67
  • fixed.acidity and pH: -0.68
  • total.sulfur.dioxide and free.sulfur.dioxide: 0.67

We can see a very sligh positive correlation here, even though it’s more concentrated around low values. This makes sense because citric acid is an acid and fixed acidity is a measure of acidity.

The correlation is also obvious here, with concentration around mid values. Most acids are denser than water, so increasing the concentration of acids would therefore increase the density.

Correlation is negative here. pH is a measure of acidity. However, the lower the value, the more acidic. So the negative correlation makes sense.

Even though it’s concentrated around low values, there is a very slight correlation visible.

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

volatile.acidity seemed to be reversely related to quality. In general, as quality went up, volatile.acidity went down. Which when thought about makes sense, because volatile.acidity measures the amount of acetic acid in wine. According to the Info provided with the dataset, high levels of acetic acid can lead to an unpleasant, vinegar taste.

The opposite was observed with citric.acid, however. Higher quality wines seem to have higher values of citric.acid. According to the info citric acid can add ‘freshness’ and flavor to wines, which explains the higher ratings.

Also, alcohol was positivly related to the rating. Which was a bit suprising, as I thought people drank wine more for the taste and not the alcohol factor. It seems, however, that the concentration of alcohol corresponds to higher rating in general.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There were mupltiple pairs of features that had similar distributions, but after invistigation, there didn’t seem to be any correlation between them. I decided to look at the correlations, however. And then invistigated pairs that had correlations higher than 0.6 (or lower than -0.6). These plots were more promising as an obvious correlation was observed.

fixed.acidity is positivly correlated with citric.acid and density. These both make sense because citric acid is an acid and fixed acidity is a measure of acidity. Also, Most acids are denser than water, so increasing the concentration of acids would therefore increase the density.

fixed.acidity was also correlated with pH. But this time it was negativly correlated. Since pH is a measure of acidity from 0 to 14, 0 being the most acidic. So it makes sense that the pH value would be lower as the fixed.acidity gets higher

Also, total.sulfur.dioxide was positivly correlated to free.sulfur.dioxide, which makes free.sulfur.dioxide is a subset of total.sulfur.dioxide.

What was the strongest relationship you found?

The strongest visible positve relationship was fixed.acidity and density. Strongest negative is fixed.acidity and pH.

Multivariate Plots Section

From the previous plots we saw how fixed.acidity was positvly correlated with density, and negatively correlated with pH. It would be nice if we can see that in one graph.

As we expected. As we move to the right-bottom corner (we increase density and decrease pH) the fixed.acidity increases. We log scaled the y-axis instead because there is almost no variance in the x-axis. Value range from 0.99 to 1 so taking a log wouldn’t make any differance. Still, there is almost no changevisible, because y-axis variance is also very low.

Now let’s try to find out more interesting things.

In the prior graphs and analysis, it seemed like citric.acid was positvely related to the quality, while on the other hand volatile.acidity was negativly related. I would like to see how the distribution of quality is as a function of these two variables.

Interesting! There is a pattern here. We see the consentration of good quality wines at the bottom right of the graph, where cetric.acid is high and volatile.acidity is low. And the bad quality the the top left

Alcohol seemed like an important feature of good wine too. So let’s see the distribution of quality against citric.acid and alcohol.

As expected. Higher levels of citric.acid and alcohol correspond to better quality in general.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The two most important features were alcohol and citric.acid. As we saw in the plots, higher levels of citric.acid and alcohol correspond to better quality in general.

Were there any interesting or surprising interactions between features?

It was interesting how as the quality went up, both citric.acid and fixed.acidity went up as well. Even though, volatile.acidity causes quality to go down. This shows how fixed.acidity is not heavily affected by volatile.acidity.


Final Plots and Summary

Plot One

Description One

Plot of the distribution of volatile.acidity respective to quality. It shows how in general, quality decreases as volatile.acidity increases.

Plot Two

Description Two

This plot shows how fixed.acidity varries with both pH and density. Can easily see positive correlation with density, and negative correlation with pH.

Plot Three

Description Three

This plot shows how quality is affected by both alcohol and citric.acid. A general trend is seen where quality is low when both values are low, and highest when either are high.


Reflection

The Red Wine dataset contains almost 1600 rows of wine samples that have been tested by at least 3 wine experts. Each row contains 11 features that describe the chemical features of the wine sample. Date was collected around the year 2009. I started by trying to understand each individual feature by looking at the distributions. As I gained some insight, I moved on and made plots using each feature to gain more information. I noticed some similarities between some features distributions, so I went and explored those relationshops. Lastly, I explored the relationshop between the quality feature and every other feature to try and find a pattern or an indication of a good wine, based on it’s chemical values.

There was an obvious relationship between Fixed Acidity and both Density and pH. This can be explained using chemistry, as acid has higher concentration than water, and low pH values by difinition. There was also somewhat of a clear trend between the Citric Acid concentration and the quality. According to what the experts mentioned, Citric Acid naturally gives wine the ‘freshness’ feeling, so the correlation makes sense. However, I was surprised to find out that alcohol is also positivly correlated with good wines.

One limitations to this dataset is the size. 1600 is not a large enough number to be a represenstive sample. Therefore, there might have been some bias towards a certain type of wines that could have specific kind of features in them. Also, the dataset was collected in 2009. There are probably newer ways to make wine now which might alter it’s components. So this dataset might not be a good representation of how wine is nowadays. In the future we shoud do a re-run of this analysis on a larger dataset that is more represenstive of the population. Then a comparison between the results and trends between this analysis and the new one can be made.